The data explored in this report was provided by Prosper and contains information on 133,937 loans (data last updated 03/11/2014). The data set contains 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
A preliminary univariate exploration of the data was completed in order to gain an initial understanding of the individual variables within the dataset.
The variables comprise of a number of different data types, including continuous, discrete numeric values, discrete character values and some datetime values. Summaries of the variables explored in this report are shown below.
## ListingCreationDate Term LoanStatus
## Min. :2005-11-09 20:44:28 Min. :12.00 Length:113937
## 1st Qu.:2008-09-19 10:02:14 1st Qu.:36.00 Class :character
## Median :2012-06-16 12:37:19 Median :36.00 Mode :character
## Mean :2011-07-09 08:07:23 Mean :40.83
## 3rd Qu.:2013-09-09 19:40:48 3rd Qu.:36.00
## Max. :2014-03-10 12:20:53 Max. :60.00
##
## BorrowerAPR ProsperRating (numeric) ProsperScore
## Min. :0.00653 Min. :1.000 Min. : 1.00
## 1st Qu.:0.15629 1st Qu.:3.000 1st Qu.: 4.00
## Median :0.20976 Median :4.000 Median : 6.00
## Mean :0.21883 Mean :4.072 Mean : 5.95
## 3rd Qu.:0.28381 3rd Qu.:5.000 3rd Qu.: 8.00
## Max. :0.51229 Max. :7.000 Max. :11.00
## NA's :25 NA's :29084 NA's :29084
## ListingCategory (numeric) BorrowerState Occupation
## Min. : 0.000 Length:113937 Length:113937
## 1st Qu.: 1.000 Class :character Class :character
## Median : 1.000 Mode :character Mode :character
## Mean : 2.774
## 3rd Qu.: 3.000
## Max. :20.000
##
## EmploymentStatus EmploymentStatusDuration CreditScoreRangeLower
## Length:113937 Min. : 0.00 Min. : 0.0
## Class :character 1st Qu.: 26.00 1st Qu.:660.0
## Mode :character Median : 67.00 Median :680.0
## Mean : 96.07 Mean :685.6
## 3rd Qu.:137.00 3rd Qu.:720.0
## Max. :755.00 Max. :880.0
## NA's :7625 NA's :591
## CreditScoreRangeUpper AmountDelinquent BankcardUtilization
## Min. : 19.0 Min. : 0.0 Min. :0.000
## 1st Qu.:679.0 1st Qu.: 0.0 1st Qu.:0.310
## Median :699.0 Median : 0.0 Median :0.600
## Mean :704.6 Mean : 984.5 Mean :0.561
## 3rd Qu.:739.0 3rd Qu.: 0.0 3rd Qu.:0.840
## Max. :899.0 Max. :463881.0 Max. :5.950
## NA's :591 NA's :7622 NA's :7604
## DebtToIncomeRatio TotalProsperLoans OpenRevolvingMonthlyPayment
## Min. : 0.000 Min. :0.00 Min. : 0.0
## 1st Qu.: 0.140 1st Qu.:1.00 1st Qu.: 114.0
## Median : 0.220 Median :1.00 Median : 271.0
## Mean : 0.276 Mean :1.42 Mean : 398.3
## 3rd Qu.: 0.320 3rd Qu.:2.00 3rd Qu.: 525.0
## Max. :10.010 Max. :8.00 Max. :14985.0
## NA's :8554 NA's :91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding LoanOriginalAmount
## Min. : 0 Min. : 0 Min. : 1000
## 1st Qu.: 3500 1st Qu.: 0 1st Qu.: 4000
## Median : 6000 Median : 1627 Median : 6500
## Mean : 8472 Mean : 2930 Mean : 8337
## 3rd Qu.:11000 3rd Qu.: 4127 3rd Qu.:12000
## Max. :72499 Max. :23451 Max. :35000
## NA's :91852 NA's :91852
## MonthlyLoanPayment Investors BorrowerState
## Min. : 0.0 Min. : 1.00 Length:113937
## 1st Qu.: 131.6 1st Qu.: 2.00 Class :character
## Median : 217.7 Median : 44.00 Mode :character
## Mean : 272.5 Mean : 80.48
## 3rd Qu.: 371.6 3rd Qu.: 115.00
## Max. :2251.5 Max. :1189.00
##
For loans in which the variable ‘LoanStatus’ is ‘PastDue’, the ‘PastDue’ status is accompanied by a delinquency bucket indicating how late the payment is. In order to investigate this variable further, this variable was split into two variables, a ‘LoanStatus’ variable and a new variable, ‘PastDueBucket’, which contained the delinquency bucket for ‘PastDue’ loans. In a similar way, ‘ListingCreationDate’ was split into two variables, ‘CreationDate’ and ‘CreationTime’. ‘CreationDate’ was then converted in POSIXct format. These transformations were completed using the code below.
LoanDataClean <- separate(LoanData, LoanStatus,
into = c("LoanStatus", "PastDueBucket"),
sep = "[()]", extra = "drop")
LoanDataClean <- separate(LoanDataClean, ListingCreationDate,
into = c("CreationDate", "CreationTime"),
sep = " ", extra = "drop")
LoanDataClean$CreationDate <- as.POSIXct(LoanDataClean$CreationDate,
format = "%Y-%m-%d")
The majority of loans are of either current or completed status. Of those loans that are Past Due, a large proportion of them fall within the 1-15 days bucket, indicating that many late payments are made within about two weeks of the due date.
Using the factor function the numeric values for the variable ‘ListingCategory (numeric)’ were substituted for their corresponding categories, and a new variable ‘ListingCategory’ was created.
LoanDataClean$ListingCategory <-
factor(LoanDataClean$`ListingCategory (numeric)`,
levels = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20),
labels = c("Not Available", "Debt Consolidation", "Home Improvement",
"Business", "Personal Loan", "Student Use", "Auto",
"Other", "Baby&Adoption", "Boat", "Cosmetic Procedure",
"Engagement Ring", "Green Loans", "Household Expenses",
"Large Purchases", "Medical/Dental", "Motorcycle", "RV",
"Taxes", "Vacation", "Wedding Loans"))
ggplot(data=LoanDataClean, aes(x=ListingCategory))+
geom_bar()+
theme(axis.text.x=element_text(angle=90,hjust=1, vjust=0.5))
As seen from the graph above, the majority of loans in this dataset were taken in an attempt to consolidate other debt.
The graphic above displays a relatively normal distribution of Prosper ratings given to the loans in this dataset, with possibly a small bias towards less risky loans (indicated by a higher Prosper rating). This distribution is indicative of the business management undertaken by Prosper to manage the risk of the loans on their accounts.
Similarly, this graphic shows a normal distribution trend for the Prosper Score values.
n1 <- ggplot(data=LoanDataClean, aes(x=Investors))+
geom_histogram(binwidth=5)
n2 <- ggplot(data=(subset(LoanDataClean, Investors >= 5)),
aes(x=Investors))+
geom_histogram(binwidth=5)
grid.arrange(n1, n2, ncol=1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 44.00 80.48 115.00 1189.00
Although there was a loan with 1189 investors, the majority of loans had much fewer than that, with the 3rd quartile sitting at 115 investors. From the top graph of the panel, it can be seen that the majority of loans had between 0 and 4 investors.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
The minimum loan taken out was $1000 and the maximum $35000. The data is positively skewed with loan values of $4000, $10000, $15000, $20000 and $25000 showing exceptionly high count values (and to a lesser extent $1000, $2000, $3000 and $5000), indicating that most loans are rounded to the nearest $1000, and for loans of $10000 and above, to the nearest $5000.
A histogram of the variable ‘ProsperPrincipalBorrowed’ shows similar features to that of ‘LoanOriginalAmount’.
The ‘MonthlyLoanPayment’ data is positively skewed with a slightly multimodal distribution, possibly a reflection of the nature of borrowers to take out loans that are multiples of $5000. The monthly payments should reflect the original loan amount, however there seems to be some distribution of data around the modes, indicating that there are other variables that also affect the monthly payment amount. An expection to this is the large peak seen at around $170. Maybe this is an introductory rate or other special deal?
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 1627 2930 4127 23451 91852
The dataset contains a variable called ‘ProsperPrincipleOutstanding’ which indicates the amount outstanding in Prosper loans that the borrower already had when taking out the Prosper loan listed. In the summary table above, a null value (indictive of the borrower having no prior loans) occured 91852 times, accounting for 81% of cases. Plotting the remaining data for this variable shows that in the vast majority of cases the borrower had paid off all prior loans. Subsetting this data to only include data in which ProsperPrincipalOutstanding >= 1 enables the distribution of the data for this variable to be more clearly seen.
The data is positively skewed with the median value falling at $1626 and the 3rd quartile value falling at $4127, compared to a maximum value of $23451. High count peaks occur around $900, $1700, $3300, $4100 and $6200. This is most likely an artifact of the high frequency of certain original loan amounts (see above), possibly combined with another variable.
It seems that there are 3 different term lengths offered by Prosper, of 12, 36 and 60 months. Borrower APR rates seem to have a relatively normal distribution. It seems the rate of loan creation has increased markedly from 2013 onwards, whilst there was a period in 2009 where no loans were created at all, most likely a side-effect of the financial crash which occured in 2008.
It is clear that some states, California in particular, account for far more of the borrowers than others.
This data implies that that most borrowers are employed. Further bar charts were created to look at the distribution of occupations that the borrowers listed when taking out the loans.
Those who list their occupation as ‘Other’ or ‘Professional’ account for the vast majority of borrowers in this dataset.
Credit score has a relatively normal distribution. I was interested in seeing if other variables that reflect spending and borrowing traits have a similar trend.
These plots show various trends. The variables ‘AmountDelinquent’, ‘DebtToIncomeRatio’ and ‘OpenRevolvingMonthlyPayment’all show positively skewed data with the frequency of all variables decreasing as the value increases. The majority of borrowers had no other Prosper loans at the time the listing was created but of those that did, most had only 1 other Prosper loan. Finally, the ’BankcardUtilization’ count steadily increases with increasing utilization. The summary data for this variable is below:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.310 0.600 0.561 0.840 5.950 7604
The median ‘BankcardUtilization’ value of 0.600 indicates that over half of the borrowers had utilized over half of their available credit when taking out the loan. In fact 25% of borrowers had over 84 % bank card credit utilization. As a note, the maximum value given in the summary table must be anomalous as only values between 0 and 1 are valid.
There are 113937 loans with 81 variables on each loan. These variables include those related to the loan in question, for example original loan amount, loan status, Prosper rating and listing category, as well as those related to the borrower, such as borrower state, employment status and occupation.
Most of the loans listed are either current or completed, and the majority of loans were taken out for debt consolidation. The loans amounts often take common values, rounded to the nearest $1000, or for loans above $10000, rounded to the nearest $5000. The distribution of the original loan amounts in this dataset are highly positively skewed, with the median original loan amount value being $6500 whilst the maximum value is $35000.
Most of the borrowers listed in this dataset are employed and list ‘Other’ or ‘Professional’ as their occupation.
There are many features in this dataset, however, ones that I am particularly interested in include original loan amount, Prosper rating and Prosper score. I would like to know if there is any correlation between these variables, and possibly other variables within the dataset.
Other than the variables highlighted above, I would also like to investigate if there is any relationship between credit score, income to debt ratio, Prosper principle borrowed and monthly loan payments.
I used the separate function from the tidyr library to split the ‘LoanStatus’ variable into a ‘LoanStatus’ variable and a new variable, ‘PastDueBucket’, which contained the delinquency bucket for ‘PastDue’ loans. This enabled the length of time loans had been ‘PastDue’ to be investigated. The separate function was also used to split the ‘ListingCreationDate’ variable into the ‘CreationDate’ and ‘CreationTime’ variables. The ‘CreationDate’ variable was then converted to POSTXct format.
A new variable ‘ListingCategory’ was created by using the factor function to exchange the numeric values in the variable ‘ListingCategory (numeric)’ for their corresponding categories.
Lastly, a new variable ‘CreditScoreRangeMean’ was created using the mean value found from the values of the ‘CreditScoreRangeUpper’ and ‘CreditScoreRangeLower’ variables.
I log-tranformed the ‘EmploymentStatusDuration’ distributions for each employment status, which were all originally postively skewed. This enabled a better comparison between the different distributions for each employment status.
As noted above, I was interested in looking at the relationship between the variable ‘LoanOriginalAmount’ and other variables in the dataset.
The vertical dark lines at $10000, $15000, $20000 and $25000 represent the high occurance of loans at these rounded values, as discussed in the Univariate Plots section above. The horizontal lines at $10000, $15000 and $25000 represent the same phenomenon of borrowing rounded values, but in this case, for previous Prosper loans.
Plotting the relationship between monthly loan payment and original loan amount results in what looks like 3 different populations of data. This may be interesting to investingate further in the multivariate plot section. Within each of the 3 populations, monthly loan payment and original loan amount show a strong positive correlation, indicating that the larger the original loan amount, the larger the monthly loan payment, somewhat expectly. Again, dark vertical lines at regular loan amounts ($15000, $20000, $25000, etc) indicate that loans are commonly taken out at these values.
This is an interesting plot. The vertical dark lines at $10000, $15000, $20000 and $25000 are once again visible, indicating the high occurance of these loan values. There is a faint line visible where x = y, indicating a slight tendency for people to take out a loan equal to the amount they are currently borrowing.
There seems to be a total Prosper loan amount ‘ceiling’ of $35000, as there are no data points for which ‘LoanOriginalAmount’+‘ProsperPrincipleOutstanding’ > $35000. A distinct line of data points is visible along this boundary, indicating loans which have been taken out at this ceiling. A similar phenomenon seems to occur at ‘LoanOriginalAmount’+‘ProsperPrincipleOutstanding’ = $25000, maybe indicating that the ceiling for an individual borrower may be dependent upon a further variable.
##
## Pearson's product-moment correlation
##
## data: LoanDataClean$CreditScoreRangeMean and LoanDataClean$ProsperScore
## t = 115.87, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3637793 0.3753979
## sample estimates:
## cor
## 0.369603
Using the Pearson’s product-moment correlation measurement is seems that there is a small positive correlation between a borrower’s mean credit score range and their Prosper score.
In the univariate section of this report I looked at the distribution of the employment status of the borrowers, and found that most borrowers are employed. I wanted to look further into this by studying the distributions in the duration of each employment status.
Comparing the data above, it seems that those working part-time and those that are unemployed are more likely to take out a loan sooner after realising these employment statuses.
##
## Pearson's product-moment correlation
##
## data: LoanDataClean$CreditScoreRangeMean and LoanDataClean$AmountDelinquent
## t = -21.517, df = 106310, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07183201 -0.05986200
## sample estimates:
## cor
## -0.06584937
##
## Pearson's product-moment correlation
##
## data: LoanDataClean$CreditScoreRangeMean and LoanDataClean$OpenRevolvingMonthlyPayment
## t = 47.252, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1332758 0.1446941
## sample estimates:
## cor
## 0.1389896
##
## Pearson's product-moment correlation
##
## data: LoanDataClean$CreditScoreRangeMean and LoanDataClean$BankcardUtilization
## t = -144.58, df = 106330, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4103489 -0.4003027
## sample estimates:
## cor
## -0.405338
##
## Pearson's product-moment correlation
##
## data: LoanDataClean$CreditScoreRangeMean and LoanDataClean$DebtToIncomeRatio
## t = -4.2633, df = 104800, p-value = 2.016e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.019221406 -0.007114667
## sample estimates:
## cor
## -0.01316852
##
## Pearson's product-moment correlation
##
## data: LoanDataClean$CreditScoreRangeMean and LoanDataClean$TotalProsperLoans
## t = -2.6131, df = 113340, p-value = 0.008974
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.013582490 -0.001939898
## sample estimates:
## cor
## -0.007761457
##
## Pearson's product-moment correlation
##
## data: LoanDataClean$CreditScoreRangeMean and LoanDataClean$ProsperPrincipalBorrowed
## t = 16.667, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04363554 0.05525037
## sample estimates:
## cor
## 0.04944463
It seems that out of the variables studied in the plot above, only bankcard utilization correlates significantly with a borrower’s credit score. The correlation is negative with a Pearson’s r value of -0.405.
##
## Pearson's product-moment correlation
##
## data: LoanDataClean$ProsperScore and LoanDataClean$`ProsperRating (numeric)`
## t = 289.74, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7018231 0.7085876
## sample estimates:
## cor
## 0.7052214
The Prosper score and Prosper rating variables are strongly positively correlated with a Pearson’s r value of 0.705. This is not surprising and it could be that both variables are measuring very similar things, or outcomes of similar models.
##
## Pearson's product-moment correlation
##
## data: LoanDataClean$ProsperScore and LoanDataClean$LoanOriginalAmount
## t = 80.475, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2600308 0.2725335
## sample estimates:
## cor
## 0.2662933
##
## Pearson's product-moment correlation
##
## data: LoanDataClean$ProsperScore and LoanDataClean$Investors
## t = 98.59, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3145438 0.3266177
## sample estimates:
## cor
## 0.3205938
##
## Pearson's product-moment correlation
##
## data: LoanDataClean$ProsperScore and LoanDataClean$AmountDelinquent
## t = -12.128, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04831553 -0.03488190
## sample estimates:
## cor
## -0.04160059
##
## Pearson's product-moment correlation
##
## data: LoanDataClean$ProsperScore and LoanDataClean$DebtToIncomeRatio
## t = -40.909, df = 77555, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1522180 -0.1384397
## sample estimates:
## cor
## -0.1453359
Out of the four variables explored in the plot above, the only one that has a Pearson’s r value indictative of a significant correlation with a Prosper score is the number of investors for a given loan (Pearson’s r value = 0.321).
Some correlation is seen between Prosper score and original loan amount however in this case any correlation may be affect rather than cause, with a higher Prosper score enabling larger loan amounts to be agreed.
The relationship between monthly loan payment and original loan amount seemed to show 3 different populations of data. Within the 3 populations, monthly loan payment and original loan amount show a strong positive correlation in line with a larger original loan amount resulting in a larger the monthly loan payment.
Plots of ProsperPrincipalBorrowed vs. LoanOriginalAmount, and ProsperPrincipleOutstanding vs. LoanOriginalAmount both show vertical bands where many original loan amounts are rounded to whole values of $1000 or $5000.
The plot of ProsperPrincipleOutstanding vs. LoanOriginalAmount was especially interesting, giving evidence of a total Prosper loan ‘cap’ or upper ceiling of $35000. There is also evidence of a similar phenomenon occuring at $25000 in many cases. Finally, this plot showed evidence of a slight tendency for people to take out a loan equal to the amount they are currently borrowing.
There were no variables found that correlated strongly with credit score, with bankcard utilization giving the strongest correlation with a Pearson’s r value of -0.405.
The Prosper score and Prosper rating variables are strongly positively correlated, but are likely to be two different measures of the same thing. No other variable was found to correlate strongly with Prosper score, however, the number of investors did have a small correlation with Prosper score (Pearson’s r = 0.321).
It seems that the models which works out a borrower’s credit score and Prosper score are most likely quite complicated, taking into account many variables, and so no one variable has a big influence.
The Prosper score and Prosper rating variables are strongly positively correlated, giving a Pearson’s r value = 0.705. It is possible that both of these variables are measures of similar things, for example they could be the result of similar models which both make use of the same variables.
It seems that loan term length determines the 3 separate populations which arise in the relationship between monthly loan payment and original loan amount. As expected, the shortest term of 12 months results in the highest monthly payments, as the loan must be paid off more quickly. The scatter in the data may be due to borrowers having slightly different agreement terms, for example, different interest rates, special discounts or special opening rates. These differences should be captured within the variable ‘BorrowerAPR’.
My inital thought looks correct and the variation within each population does seem to be due to borrowers having different ‘BorrowerAPR’ values, with a lower APR value resulting in lower monthly payments.
It seems that most of the 12 month term loans are now completed, maybe this term length is no longer offered by Prosper for new loans. Most of the 36 and 60 month term loans are still current, however, many of the 36 month term loans at the highest levels of borrower APR are either completed or ‘Chargedoff’ and so deemed unlikely to paid, maybe indicating that these are older loans. The majority of the orignal loan amounts fall below $25000 however there are a proportion of current loans whose original amounts fall between $25000 and $35000. This is another visualisation of the phenomenon of there being two total loan amount ‘ceilings’ as described in the bivariate analysis section above, and visualised in the plot of ProsperPrincipalOutstanding vs. OriginalLoanAmount.
This plot reaffirmed many of my suspicious about the data. It seems that all loans over $25000 are relatively recent. Those loans with the highest rates of borrower APR seem to be the oldest loans. Contrary to my thoughts that Prosper may no longer offer 12 month loans, it seems that many of these loans are relatively recent. Maybe it is just the fact that these 12 month loans get paid off more quickly that accounts for why a high proportion of them are shown as ‘completed’ in the previous plot.
In line with the previous plot, it seems that having a total Prosper loan amount between $25000 and $35000 only occurs for the most recent loans. Maybe this reflects a recent change in Prosper policy.
Taking out a further loan which pushes a borrowers total Prosper loan amount over $25000 seems to be available only for loans with a high Prosper rating, indicating a lower risk loan.
This investigation sheds light on some policy changes that Prosper made over the time that this data was collected, such as an increase in total allowed Prosper loan amount from $25000 to $35000 for those loans with high Prosper ratings. This policy change accounted for some unusual features that had been seen in previous graphs.
The variables Borrower APR rate and loan term length were used to explain the 3 populations of data previously seen in the plot of MonthlyLoanPayment vs. LoanOriginalAmount first explored in the bivariate plots section. More Prosper policy changes were made apparent, such as the decrease in the borrower APR value.
The relationship between loan creation date and total Prosper loan value was unexpected, as I initially did not consider that Prosper policy changes could be unearthed by exploring this dataset. This relationship took my data exploration down a path I did not expect!
I did not create any models with my dataset.
The distribution of employment status duration at the time the listing was created has a slight negative skew on a log scale for all employment status types except the ‘not employed’ and ‘part-time’ types. The ‘part-time’ type shows a more normal distribution and the ‘retired’ type shows a gradual decrease in the number of listings created as the duration of the retirement increases. For both of these status types however, a relatively high number of listings are made within the first year of the status duration. In general, for these two employment status types, a listing is made earlier in the employment status duration than for the other employment status types.
This plot highlights the general trend that loans are often taken out at rounded values (rounded to the nearest $1000 for loans below $10000, and to the nearest $5000 for loans over $10000) as shown by the darker vertical and horizontal lines at $10000, $15000, $20000 and $25000. It also shows that the majority of original loan amounts are below $25000.
This plot of MonthlyLoanPayment vs. LoanOriginalAmount shows the linear fit applied to data with a term length equal to 12, 36 and 60 months, highlighting the 3 different populations of data which result from these term lengths. The data points are also shaded depending upon the value of the variable ‘BorrowerAPR’, indicating the ‘noise’ that this variable creates in the data for each population.
This dataset, provided by Prosper, contains information on 113,937 loans, with details of 81 variables on each loan. The large number of variables meant that I had to be quite selective with what I wanted to investigate. I started by looking at over 20 variables in the univariate part of this report that I thought could be particularly interesting, such as loan status, Prosper rating, Prosper score, original loan amount, employment status and employment status duration, enabling me to build up a picture of the dataset.
In the bivariate analysis part of this report I started to explore relationships between the variables, looking at the relationship between a borrower’s previous loan history and the size of a new loan that a borrower took out. I also explored the relationship between a number of variables and a borrower’s credit score as well as a loan’s Prosper score. Suprisingly to me, I found that in both of these cases no one variable had a great correlation with either credit score or Prosper rating, even variables such as ‘AmountDelinquent’ and ‘DebtToIncomeRatio’ which I assumed would have a large affect. I concluded that the models used to calculate these scores and ratings are complicated, likely with multiple variables that together determine the outcome.
Eventually I considered further the unusual shaped plots that I had uncovered when looking at the relationship between monthly loan payment vs. original loan amount and Prosper principle outstanding vs. original loan amount. In my multivariate analysis I used the variables loan term length, borrower APR, loan status, listing creation date and Prosper rating to make sense of these plots and fully explain their structure. In doing so I uncovered policy changes that Prosper had implemented over the time this dataset was collected. These changes include an increase in the total allowed Prosper loan amount from $25000 to $35000 for those borrowers with high Prosper ratings as well as changes to the rate of borrower APR.
There is a huge amount of scope to develop this analysis further as many variables have not been explored at all. It would be interesting to look at these variables and see if they, plus the variables explored in this report, could be used to form a model able to predict credit score or Prosper rating. The variables related to the number of credit lines a borrower has, or their number of public records, as well as their payment history, may be particularly interesting to look at.